20 ◾ Bioinformatics
an input and generates quality assessment reports including per base sequence quality, per
tile sequence quality, per sequence quality scores, per base sequence content, per sequence
GC content, per base N content, sequence length distribution, sequence duplication levels,
overrepresented sequences, adaptor content, and k-mer content. FastQC supports all vari-
ants of FASTQ formats and gzip-compressed FASTQ files.
We will download some public single-end FASTQ files from an NCBI BioProject with
an accession “PRJNA176149” for practicing purpose. The SRA files of this project contain
genomic single-end reads of Escherichia coli str. K-12. To keep the files organized, we can
create the directory “ecoli” using “mkdir ecoli” and then move it inside this directory “cd
ecoli” and save the following IDs (each in a line) in a text file with the file name “ids.txt”
using any text editor:
SRR653520
SRR653521
SRR576933
SRR576934
SRR576935
SRR576936
SRR576937
SRR576938
Then, run the following script to create the subdirectory “fastQC” and to download the
FASTQ files associated with the IDs stored in the “ids.txt” file into the directory:
mkdir fastQC
while read f;
do
fasterq-dump \
--outdir fastQC “$f” \
--progress \
--threads 4
done < ids.txt
Once the raw FASTQ files have been downloaded, we can use the command “ls -lh fastQC”
to display the file names as shown in Figure 1.10.
FIGURE 1.10 The names of the downloaded FASTQ files.